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Consistent selection via the Lasso for high 
dimensional approximating regression 

models 

Florentina Bunea*^ 

Florida State University 

Abstract: In this article we investigate consistency of selection in regression 
models via the popular Lasso method. Here we depart from the traditional 
linear regression assumption and consider approximations of the regression 
function / with elements of a given dictionary of M functions. The target for 
consistency is the index set of those functions from this dictionary that realize 
the most parsimonious approximation to / among all linear combinations be- 
longing to an L2 ball centered at / and of radius ^^j. In this framework we 
show that a consistent estimate of this index set can be derived via £1 penal- 
ized least squares, with a data dependent penalty and with tuning sequence 
rn,M > •^log{A'/n)/n, where n is the sample size. Our results hold for any 
1 < M < nT, for any 7 > 0. 
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1. Introduction 

In this paper we show that the popular Lasso technique can be used for consistent 
feature selection in high dimensional approximating regression models. We consider 
the following framework. Given a random pair {X,Y), we let f{x) = E{Y\X = x) 
be the conditional mean function, henceforth called the regression function. We aim 
to reconstruct consistently a sparse approximation of / via linear combinations of 
elements of a given dictionary of functions — {/i, . . . , /a/}. This reconstruction 
will be based on {Xi,Yi), . . . , (X„, Yn), a sample of independent random pairs dis- 
tributed as {X,Y) £ (A", 3?), where X is a. Borel subset of 3?'*; all functions fj are 
defined on X. Our aim expresses the belief that, in many instances, even if M is 
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large, only a subset of T may be needed to approximate / well. If that is the case, it 
may be of interest to determine whether this set can be estimated consistently via a 
computationally efficient method. The focus of this work is on consistent selection 
via the Lasso when the size of grows polynomially with the sample size n, that 
is M = n'^ , for any 7 > 0. 

We begin by giving a number of examples of dictionaries J- and associated con- 
sistency issues. 

1. If d = M and fj{X) = Xj for all j, one may be interested in identifying the 
subset of variables with linear combinations close to /. A familiar particular case 
is linear regression, where one assumes that f{X)^ X'X, with A G having 
non-zero components in positions corresponding to a set J* C {1, . . . , Af}. Here 
we depart from this traditional equality assumption and consider the more re- 
alistic case where / is not equal to, but can be well approximated by a linear 
combination of the given variables. We discuss this in detail in the next section. 

2. Another problem of interest is that of finding consistently a sparse linear approx- 
imation of / realized with elements from a large list of M possibly competing 
estimators. These estimates may correspond to M different methods of estima- 
tion, may be computed from M different samples with the same mean function, 
or may correspond to Af different values of a tuning parameter of the same 
method. Instances of the latter arise in kernel based methods that require the 
choice of a grid of values for the bandwidth parameter or in Bayesian methods, 
where the specification of a grid of values for hyper-parameters is needed. A 
consistent identification of a subset of the estimates in these examples would 
validate the use of a particular restriction on an initially large grid. In such sit- 
uations, when the elements of T are estimators, we will assume that they have 
been computed on samples independent of the one used for subset selection and 
treat them here as fixed functions. 

3. A last example is the nonparametric estimation of / from a collection of M given 
basis functions, where only a subset may realize a good approximation of /, as 
described in the following subsection. 

There exist a number of model selection methods that yield consistent subset 
selection in regression models. In discussing them a number of distinctions are 
needed. 

The first one pertains to the evolution of the literature on model selection tech- 
niques in regression. One important cut-off point in this evolution seems to be the 
computational complexity of a particular method and, within this, the size of M 
relative to n plays a crucial role. If Af < n, procedures based on various information 
criteria occupy an important place. They are referred to now as the BIC/AIC-type 
methods; we mention here the seminal works of ([1], [15]), the unifying theory of 
[2], and, various generahzations of these methods ([4], [7]). Such procedures can 
be easily implemented for small to moderate M. For larger values of Af multiple 
testing procedures, in particular of the FDR type (e.g., [.3], [9]), or cross-validation 
with all its variants (holdout validation, fC-fold) [21], are popular, but become 
more computationally complex as M increases. If Af > n these techniques may be- 
come computationally intractable, unless they are used as part of a multiple-stage 
scheme. For a further overview on computational aspects in model selection, from 
a Bayesian perspective, see [11]. 

Whereas the above mentioned methods can still be used in very particular re- 
gression models when Af > n, for instance, for sequence-space models, where model 
selection via BIG is equivalent to hard thresholding, they typically fail, computa- 
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tionally, when M is large. A standard solution in this case is to seek estimates that 
solve a certain class of convex optimization problems. Among the most popular 
estimates of this type in regression is the penalized least squares estimate with 
an ^i-type penalty (Lasso), which we describe in detail in the next section. In a 
Bayesian framework it can be derived from a Gaussian likelihood with a Laplace 
prior. Two important aspects set the regularized (Lasso) type estimators apart: 
they are easy and fast to compute; see [y], [13], [14], [lb], among others, for efhcient 
algorithms; and, if M > n, some components of the estimate will be set to zero, 
in finite samples, see, e.g., [13]. Therefore, via a one-step easily implementable pro- 
cedure, one obtains subset selection even if M > n. To date, this method (or its 
variants) is the most widely used in regression problems of very high dimension, 
especially when dimension reduction is of interest. 

The second distinction in discussing consistency of selection in regression is re- 
lated to the target for consistency. Consistency of selection has been studied for all 
aforementioned techniques only in the following context, which we term parametric: 
the target for selection is typically an index set J* corresponding to the non-zero 
true regression coefficients, whereas the remaining coefficients are assumed to be 
exactly zero. An estimation method that uses the data and all M elements fj to 
yield a subset I of indices such that P{I = J*) 1 for large n is called a consistent 
method of selection. 

In light of these two distinctions we give below a summary of the existing re- 
sults on consistency of selection. They have all been established for the traditional 
parametric target J*. 

If M < n and under appropriate assumptions all the above methods, or close 
variants, yield consistent subset selection for the parametric target J*. References 
include those for AIC/BIC-type methods ([4], [10], and [22], among others), multiple 
testing procedures [•"•], cross-validation procedures [Ki], and Lasso-type procedures 
[24]. 

If M > n consistency of selection has only been studied for Lasso- type estimators. 
Again, in the existing literature, the target is the standard target J*. The results are 
limited. Meinshausen and Buhlmann [12] showed that P{I = J*) ^ 1 in Gaussian 
graphical models, under assumptions that are tailored to models for which, in our 
notations, {Y,Xi,..., Xm) ~ -^(0, S). Consistency of selection has been established 
when M > n, for fixed design linear regression models and a target set J* that 
corresponds to coefficients A* that are assumed to be lower bounded by a sequence 
of order 0{n~^^^), for < (5 < 1 [23]. Similar results, under slightly different 
assumptions, have also been obtained for a three stage procedure [20]: in the first 
stage Lasso estimates are computed for a number of values of the tuning parameter, 
in the second step cross-validation is performed to select one Lasso estimate, and 
in the third one the model is refitted on the variables present in the selected Lasso 
estimate. We also refer to a related notion of consistency, in fixed design regression 
with Gaussian errors [19]. 

If M > n consistent subset selection via the Lasso has not been investigated, to 
the best of our knowledge, in the general framework we describe in detail below. 
Within this framework, we extend the existing results to more general regression 
models on a random design and a more general target index set. 



Consistent selection via the Lasso 



125 



1.1. Beyond linear regression 

Despite its practical appeal, the study of selection procedures that are consistent 
for target sets other than the classical one has received very little attention. Our 
target set will be defined relative to linear approximations of / with elements of T 
with respect to the L2{v) norm || • ||, where we denote the probability measure of 
X by V. 

Formally, define 

M 

(1-1) K={\e^'': II A,/, - /ir < Cfvl,, 

where C/ > is a constant depending only on / and r„_M is a positive sequence 
that converges to zero and which will be specified in the next section. In what 
follows we assume that A is not void. For any A S 3?*^ we let J(A) denote the 
index set corresponding to the non-zero components of A and denote by M(A) its 
cardinality. Let k* = min{A/(A) : A G A}. We define our target vector 



1.2) A* = argmin ^ || > ' Aj/j - /f : A e M.^' , A/(A) = k 




Let /* = «/(A*) denote the index set corresponding to the non-zero elements of A* 
and note that /* has cardinality k* . Thus /* = ^*jf3 provides the sparsest 

approximation to / that can be realized with A S A and, in particular. 



(1-3) \\r-f\?<Cf, 



This motivates our treating /* as the target index set. 

We note that if one assumes, as in standard linear regression models, that f{x) = 
^j^j — Sje/* ^'j-'^j ~ f*i^)^ where A* denotes the non-zero components of 
A, then (1.3) is trivially satisfied for any positive sequence rn,M- Therefore, the 
classical target J* is a particular case of ours. 

In order to ensure that A* captures the essential features of / in a parsimonious 
way we require that its components not be unnecessarily small, otherwise we can 
place their indices outside /* . Formally, we will require that the following condition 
holds. 

Condition (C). There exists B > 0, independent of n or Af , such that 

min |A*| > Btu^m- 

We show below that £i penalized least squares can be used to estimate consis- 
tently the new target I* , even if M is larger than n, in particular if it grows as rC' , 
for any 7 > 0, under minimal assumptions on the dictionary T and appropriate 
choices for rn^i- In Section 2 below we introduce the estimate and discuss these 
choices. Section 2.1 contains our main result, Theorem 2.1, together with a discus- 
sion of the assumptions under which it holds. The proof of the main result is given 
in Section 2.2 and intermediate results are proved in the Appendix. 
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2. Consistent selection via £i penalized least squares 

We estimate the set /* of the previous section via £i penalized least squares. We 
first compute 




(2.4) A = argmin <( - > {K, - > ^ hfj{Xr)Y + pcn(A) 
where 

M 

(2.5) pcn(A) = 2^a;„j|Aj| with w„j = r„^M||/j||«, 

i=i 

for a sequence r„.A/ given below, where we write = ^"=1 ^^r any 

function g : A" — > 3?. We note that each \j in the penalty term has a different, data- 
dependent, weight. The estimate A thus obtained is in one-to-one correspondence 
with the following estimate. For each 1 < j < M define Oj ^ 2u!n,jXj and let 
A be the M x M diagonal matrix with diagonal entries 2ujn,j- Next observe that 
FA = Fi9, where F is the n x M matrix with entries fj{Xi), Fi = FA~^ and 
6 = AX. Thus, denoting by Y the n dimensional vector with entries Yi, the problem 
reduces to calculating 



1 ^ 
argmin -(y ^ FiO)' (Y - Fid) +Y 1^,1 



for which the aforementioned fast algorithms can be used. Then, we compute our 
sought solution A = A^^Q. 

We let / denote the index set corresponding to the non-zero components of A. 
We show in the next subsection that P{1 = /*) 1 when n ~> oo. We begin by 
noticing that we always have 

p{i = /*) > 1 - p{r p{i % /*). 

Therefore, proving that / is consistent reduces to showing that each of the prob- 
abilities in the right-hand side of the inequality above converge to zero. In what 
follows we motivate choices for the sequence r„_M that stem from sufRcient con- 
ditions under which this convergence is achieved. The proofs are presented in the 
next section. 

We begin by noticing that if A A*, with probability converging to one, 
then I* % I with probability converging to zero. To see this, further note that 
if component-wise consistency of A holds, we will estimate all non-zero elements of 
A* by non-zero sequences, but we may also estimate some of its zero components 
by some small, but non-zero sequences. In light of this fact, a first set of restrictions 
on rn,M will be such that A is close to A*, in the sense below. It follows immediately 
(by ["i]. Theorem 2.3; see the Appendix below for a full formulation) that, with high 
probability 

r„,M|A-A*|i <i?{j|/-r||2 + rr2,,,}, 

for some positive constant D, and where |a|i = S^=i l^il denotes the li norm of 
any vector in 3?^^. Next, notice that the optimal parametric rate of convergence 
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for a component Xj of A is of order 1 / y/n, and it can be achieved if we knew /* of 
cardinality k* < M in advance. However, this is not known, so the best we can do is 
mimic this behavior in our context. We can do this by choosing Tn.M of order 
where we recall that we have assumed that \\f — < Notice further that 

this choice is optimal for the rate of convergence of A, which is not the focus here. 
Indeed, more modest rates of convergence of A can be considered when consistency 
of selection is of main importance. We discuss in detail two concrete choices, and 
defer a complete analysis for future work. 

One can consider rn,M = A^J\og{Mn)/n, for an appropriately large constant 
A > {). Notice that this choice differs from the one that yields the optimal rate 
only by logarithmic factors, which are needed to accommodate dictionaries with 
M > n. With this choice, the target set /* corresponds to linear combinations of 
the elements of !F that belong to, up to logarithmic factors, a \pn neighborhood 
of /, with respect to the Li(v) norm. This provides only a slight departure from the 
standard linear model assumption and standard target index set J* . It is therefore 
not surprising that, in this case, our tuning sequence rn,M is also comparable to 
the one considered in parametric models ([12], [23]), where a sequence of the order 
of l/?!^/^^^, Q <E (0, 1/2), is employed. We note that this choice is slightly conser- 
vative, and can be relaxed to 0{y^\og{Mn)/n) in our framework, and therefore, as 
a particular case, in theirs. 

In order to accommodate consistent selection in a purely nonparametric frame- 
work we need to increase the size of r„^M- For instance, if all fj are estimates of /, 
and Tji^jvf is as before, the set A defined in (1.1) may be empty, as non-parametric 
estimates of / have typically slower rates than 1 / ^/n. We therefore consider target 
sets /* corresponding to L2{v) neighborhoods around / of radius r^^ ^j, now with 
f'n,M = O ((log(Mn)/n)^/^). In this case, the set A given in (1.1) above is not 
empty if at least one of the estimators fj has, up to logarithmic factors, a rate 
of the order n~^/*, which is a modest rate to require. Of course, if fj{X) = Xj, 
as in linear regression, this choice means that we may be content with a coarser 
approximation than before. However, note that this approximation has the benefit 
of being realized with a smaller number of variables and that this may increase the 
interpretability of that particular model and be a desirable property in practical 
situations. 

The results presented below hold for either of these choice, in particular for any 
T^n.M ^ Ay^log{Mn) I n, and we will therefore not distinguish between them. 

2.1. Main result: consistent subset selection 

We begin by listing and commenting on the assumptions under which our result 
holds. The first assumption refers to the error terms Wi = Yi ~ f{Xi). We recall 
that f{X) = E{Y\X). 

Assumption (Al). The random variables Xi, . . . ,X„ are independent, identically 
distributed random variables with probability measure fi. The random variables Wi 
are independently distributed with 

E{W,\Xi,...,X„} = 

and 

E {expdiy^D I A"i, . . . , Xn} < b for some finite 6 > and i = 1, . . . , n. 
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We also impose mild conditions on / and on the fimctions fj. Let ||g||oc = 
sup^g^ 1.9(2^)1 foi' s-iiy function g on X . 

Assumption (A2). 

(a) There exists Q < L < 00 such that ||/j||oo < L for all I < j < M. 

(b) There exists cg > such that \\fj\\ > cq for all 1 < j < M. 

(c) There exists Lq < 00 such that E[ff{X)ff{X)] < Lq for all l<i,j < M. 

(d) There exists Li < 00 such that \\f\\oo < < 00. 

(e) There exists L* < 00 such that j|/ — /*||oc < L* . 

Remark 2.1. We note that (a) trivially implies (c). However, as the implied bound 
may be too large, we opted for stating (c) separately. Note also that (a) and (d) 
imply the following: for any fixed A G IR.^^ ^ there exists a positive constant L(A), 
depending on A, such that ||/ — X]j=i ^j.fj\\oo = L{X). Inspection of the proof of 
Theorem 2.1 below shows that we can allow L* to grow very slowly with n. However, 
for sake of clarity in presentation we opted for treating it as fixed. 

Assumption (A3). Let 

where < f, Jj >= Ef,{X)fj{X) and \\f,\\ = E^/^ff{X). Assume that 

maxmax|pA/(i,j)| < ^, 

for some constant C > 0. 

Remark 2.2. Following [(i], C — 1/45 is an allowable choice. Other choices are 
possible, but improvement of constants is beyond the scope of this paper. 

Remark 2.3. Assumption (A3) reflects the belief that the correlations between 
functions fj with j € /* and functions fj with j ^ /* should be small. However, 
we allow the correlations outside /* to be arbitrary. We note that this assumption 
replaces the standard orthonormality assumption on the design matrix: it is given 
in terms of theoretical quantities and it can hold even if M > n. It can be checked 
in practice by replacing the theoretical correlations by sample correlations. 

We denote by G the event that the nx M matrix F with entries fj{Xi) has full 
rank. To avoid additional technicalities, the results of this paper can be regarded 
as conditional on G. Otherwise, all the results can be re-derived by intersecting all 
the relevant events with G and G"^, under the additional assumption that P{G'^) is 
appropriately small. 

We can now state our main result which we prove in the next subsection. 

Theorem 2.1. If assumptions (A1)-(A3) and condition (C) hold, and k*rn.M — * 
then P{i ^r)^0. 

Remark 2.4. The convergence above holds either if M is fixed and n ^ 00 or if 
both M, n 00, if r„^j\/ > Ay^log{Mn) /n for an appropriately large constant A. 
Therefore we obtain consistency for both choices of rn^M discussed above. In our 
derivations we require that M does not grow faster than a power of n. 

Remark 2.5. The condition rn,Mk* imposes restrictions on the size of k* . 
If rn.M ~ 0{^y\og{Mn) /n) the theorem above shows that we can recover consis- 
tently subsets of size k* = 0(y^/logn), up to other logarithmic factors. The 
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choice rn_M = 0(log(Mn)/n)^/'' corresponds to a coarser approximation of / 
than before, and the restriction on the number of approximating functions is now 
k* = 0(ni/4/logn). 

2.2. Proof of Theorem 2.1 

Recall that 

p{i = r ) > 1 - P(r p{i % /*). 

Therefore, proving that / is consistent reduces to showing that each of the proba- 
bilities in the right hand side of the inequality above converge to zero. We present 
this in the following two propositions. We defer the proof of the intermediate results 
to the Appendix. 

Proposition 2.2. // assumptions (Al)~(A3) and condition (C) hold, and 
fn.Aik* —> 0, then P{I* I) —> as n ^ oo, for any Vn.M > A\J\og{Mn)/n, 
with A> Q large enough. 

Proof. We follow the same reasoning as [4]. Let c„ = min^g/. |A^| and recall that 
c„ > Brn^M, by condition (C). Therefore 

P{I* %I) < P{j ^ / for some j € /* ) 

< p(|a,-a;| = |a;|) 

< -P(|Aj — A*| > c„) -^0, as n — > oo 

where, in the second inequality, we used that Xj = for j ^ /, by the definition of /. 
The last inequality follows from Corollary 1 presented in the Appendix below. □ 

Proposition 2.3. // assumptions (Al)-(A3) hold and r„.Aik* 0, then P{I % 
/*) ^ 0, as n ^ CO, for any r„,M > A^yiog^AIn) /n, with A > large enough. 

Proof. Let 

n 

1=1 jei* 

and define 

(2.6) /i = argmin 

Let 

^ = n I 1^ Et^' - E i^JfJiX^)]fk{X,)\ < 2r„,M||/fe||„ [ . 
k^i' [ 1=1 jei' J 

Let A G 5R^^ be the vector that has the components of fi in positions corresponding 
to the index set /* and components equal to zero otherwise. Thus, by abuse of 
notation, A = (/2,0). From Lemma 3.4 in the Appendix it follows that, on the set 
S, A is a solution of (2.4). Recall that A is a solution of (2.4) by construction. Then, 
by arguments similar to those used in ([13], Theorems 3.1 and 3.2) regarding the 
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closeness of two solutions it follows that, on the set B, Xk = for k E I*'^ . Therefore 
/ C /* on the set B. Hence 

p{i%r) < PiB") 

= P\ U I 1^ Et^^ - E ~NfJ{X^)]fk{Xi)\ > 2rnM\\h 
\ke{i,...,M}\i-' [ i=i jei' 

^ E ^ ( I 1^ Et^^ - E ~^jm)]fkiX,)\ > 2r„,M||A- 
ke{i,....M}\i' \ [ " «=i jei' 

Let fc e {1, . . . , Af} \ /* be fixed. Define the sets 
Ei{k) 



i^^\f2WJk{X,)\ < 7.aM\\fk\\n/2y 



E2{k) = \\\h\\i>\\\fr 



Em 



^\Y.JAX^)fk{Xi)\ < 2\{f„fk)\+dn,M, J e 



where 6n,M = 2CL'^rn,M will be specified below. The choice of 8n,M is purely 
technical and does not affect the overall results. 

Let / = X]je/* 1^3 h- Recall that A* G R^^ given by (1.2) has zero components in 
positions corresponding to indices in I*'^ , by definition. Let fi* be the vector in 3?'' 
obtained from A* by deleting these zeros. Therefore /* = X^jli ^*ifj — ^jei' ^j/j- 
By successive applications of the triangle inequality and since ||/fc||n < L, for all 
k £ I*^ , by assumption (A2) (a), we obtain: 

(2.7) P - E ^^3fAX^)]fk{X,)\ > rnM\fk\\n 

^ P {^\^l^^h{X^)\ > r„,M||A-|l„/2^ 

+ P (^lE(/(^^) - fiX,))fM\ > r„,A/||MU/2^ 

< PiEm) + P (^^1 E(/*(^^) - fiX,))MX,)\ > r„.M||./fc||„/4^ 
+ P (^E - ^ ^«3/|l/fe||n/4ij 

< piEm) 

+P |(E(A. -M*)^E/.(^.))/fe(^OI > rn,M\\fk\\n/A 

\ jei' 1=1 J 

+ P (^E ~ ^ ^".m||MU/4L^ . 
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To bound the second term in the last inequality above we first notice that on the 
set i?3(fc) and under assumptions (A2) (a) and (A3) we have 



1 " 



n 

jei' i=i 

< 2 J2 I - A)l + Km E \f^^ - Mil 

2CL^ 

Therefore, on E2{k) n E^ik), and under assumptions (A2). (a) and (b), and (A3) 
we have 

n 

PiliY^iH ~ ^^*)-Y.MX^))fk{X,)\>rn,M\\.fk\\n/4) 

<Pi\fl~fi*\i > -^^^k*rn,M) + Pilfi - i-t*\i > ^r„,A/(5,;M) 

(2.8) <2P(|/i-Mli>^^fcV„.M), 

for n large enough, since the assumption fc*7'„^7\/ implies that k*rn,M < 1 for 
large n, and we recall that we defined Sn,M = ^CL'^rn^M- 

Lastly, notice that on the set E2{k) and under assumption (A2) (6) and (e) the 
third term of the last inequality in display (2.7) can be bounded by 

(2.9) P{-Y,\if{X,)~nX,))\ > ^r^M)- 

i—1 

To complete the proof we need to show that PiEfik)), P{E^{k)) and P{E^{k)) 
and the probabilities in (2.8) and (2.9), when summed over k S {!., . . . , M} \ I* , 
converge to zero as 7i ^ oo. We show this in Lemma 3.5, Corollary 2 and Lemma 
3.6, respectively, in the Appendix below. This completes the proof of this result. □ 

Appendix 

In order to show Proposition 2.2 and to bound (2.8) above we will use twice ([ti], 
Theorem 2.3 page 177) and we begin by stating it here, for completeness. For any 
A € 3?*^ we let J(A) denote the index set corresponding to the non-zero components 
of A and denote by M(A) its cardinality. Let p(A) = maxigj(>) maxj^i |pm(«, j)|- 
With A given by (1.1) in Section 1.1, let Ai {A e A : p(A) < C/M(A)} . 

Theorem 2.3 ([ti]). Assume that (Al) and (A2) hold. Then the £i penalized least 
squares estimator A given by (2.4) satisfies, for any A G Ai 

(3.10) P{|A- All < Birn,MM{X)} > 1 -7r„,M(A), 
where 



7rn,M(A) < 14Af exp ciTT, min 

/ M(A) 2 
exp [-'^^Y^Xj^^nM 



' L2 ' LoM2(A)' L2A/(A) 
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for some positive constants ci,C2 depending on Co,Cf and b only, and a constant 
Bi depending on co and C / . 

Notice now that by (1.3) and under assumption (AS), A* G Ai. We therefore 
have the following corollary. 

Corollary 1. Assume that (Al)-(A3) hold. Then 

f{|A, -A*| >Bir„,M} <7r*, 

for all 1 < i < M, where tt* = 7r„.M(A*). 
Proof. From ([(>], Theorem 2.3) we obtain 

1 - TT* < F {|A - A*|i < Bifc*r„,M} < ^ 1^ min, - < ^i^n^/j • 
This immediately implies the result. □ 



Remark 3.1. Notice that tt* — > as n ^ for any rn,M > Ay^\og{Mn)/n, and 
for B = Bi, as needed in Proposition 2.2 in Section 2.2 above. 

In order to control the probability (2.8) we first define U and Ui, the analogues 
of the sets A and Ai defined above. 



(7 = <; A* e : II./ - 2^ M,/,lr < Q<.M Ui = {fieU: p(/i)M(/i) < C} . 

Recall that n* is the vector in obtained from A* by deleting the zero entries. 
Then, since assumption (A3) implies max^g/* maxjg/* ^y^i |p7v/(i,i)| C/k* and 
11/ - Y.f=i ^jfjW = \\.f - J2jei' l^ifiW deduce that £ Ui. Therefore, using 
again ([6], Theorem 2.3) applied now to the dictionary {/j}jG/* ^-nd quantity fi 
defined in (2.6) above, we obtain the following corollary: 

Corollary 2. Assume that (Al)-(A3) hold. Then 

(3.11) P{|/i - Mil < S2fcV„,M} >l-p\ 

where 



p* < lAk*'^ exp —c\n min ■ 



k* 



2 

^n,J\/ rnA[ 1 1 



for some positive constants Ci, C2 as above and a constant B2 > that only depends 
on Cf and cq. 



Remark 3.2. If r„,M > A^log(Afn)/n, then Mp* as n ^ 00, for A > 
large enough. Hence, the probability given by (2.8), summed over fc, converges to 
zero for both choices of r„_j\f introduced in Section 2, adjusting the value of B2 if 
needed. 



The following lemma is needed in the beginning of the proof of Proposition 2.3. 
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Lemma 3.4. A = (/i,0) is a solution of (2.4) on the set 

< 2r„3/||/fc||„ 




1=1 jei' 



Proof. We recall that for any convex function g : 3?*^ the subdifferential 

of (7 at a pomt A is the set D\ ~ {w G di^^ : g{u) ~ g{X) > {WjU — A)}. Let 
g{X) = i Y^^i=i{Yi — J2j=i ^jfi^i)}^ + Pcn(A), where we recall that our penalty 
term is pen(A) = 2r„^Ml]^ii Then (e.g., [l:!]) we have 



where v £ ^ is such that 



M 



= F'(y-FA) + 2r„,Mt'}, 

n 



vk = IIMU, ifAfe>0 

Vk = -||/fe||n, if Afc < 

and where we recall that Y = (Yi, . . . , y„) and F is the nx M matrix with elements 
fj{Xi). By standard results in convex analysis, A £ 3?^^ is a point of local minimum 
for a convex function g if and only if G D^, where G 3?^^. Therefore, A minimizes 
our g(A) if and only if G if and only if 



-F'{Y - FA) 

n 



2rn,M\vk\ for aU k G {!,..., M}, 



where {■)k above denotes the fc-th component of the vector in paranthesis. Equiva- 
lently. A minimizes g{X) if and only if, for all 1 < fc < Af 



(3.12) 



M 



■J2i^^~J2~^,fAX^)]MX^) 



M 



Y,[y^~J2~^,fAx^)]fk{x,) 

i=l J=l 



= 2r„,M||M|„, ifAfe^O, 
< 2r„,M||/fc||„, ifAfc=0. 



In what follows we find conditions under which A = (/i,0), with ft given in (2.6) 
above, satisfies (3.12). First notice that, by definition, X^ILiI^* ~ Si=i ^jfi(-^i)] ~ 
Sr=i iYi—J2jei' t'-jfj {Xi)] . Since /i is a solution of (2.6) then, by the above standard 
results in convex analysis, applied now to the function h{X) defined in the proof of 
Proposition 2.3, the following hold 



i=l 



jei' 



i=l 



= 2r„_M||//c|U, if Afc = /ifc 7^ 0, k£ I* 
< 2r„_M||/fe||„, if Afe = /2fc = 0, k £ I* 
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Notice now that on the set B we also have 



< 2r„,M||/fc||n, if fc ^ /* (for which flk = 0). 



The above displays show that A satisfies condition (3.12) and is therefore a 
solution of (2.4) on B. □ 

Remark 3.3. The observation that constitutes the statement of the above lemma 
has also been made elsewhere [12] for a slightly different penalty term. We have 
included here a full derivation of it for completeness and clarity. 

To complete the proof of Proposition 2.3 we will make repeated use of Bernstein's 
inequality, which we state here for completeness. 

Bernstein's inequality. Let Ci, . . . , C„ be independent random variables such that 



1=1 



for some positive constants w and d and for all integers m > 2. Then, for any e > 
we have 



(3.13) 



P{Y.'^G-Ea)>ne\ <exp 



. !' = 1 



ne 



2(w2 + de) 

Lemma 3.5. Let assumptions (Al) and (A2) hold. Then 



J2 P{Et{k))^0, J2 P{Em)^0,and 
fe6{i,...,Af}\/* fee{i,...,j\f}\/* 



fee{i,...,M}\/* 



P{El{k)) ^0,as n 



Proof. To show J2kG{i M}\i' P{El{k)) ^ it is enough to show that (/) = 
P{Emr{E2{k)) ^ and that (//) = ^(^f (^O) ^ 

0. The proofs follow immediately from Bernstein's inequality and the union bound. 
They are the same as ([(>], proofs of Lemmas 4 and 5, page 186). We include here 
the derived probability bounds, for completeness. 



(/) < 2M2 exp 



' n.M 



166 



2M2 exp - 



nrn,Aico 
8V2L 



2M^ exp - 



12L^ 



and 
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To bound the last quantity in the statement of the Lemma notice first that 

jei' \ 1=1 I 
^ 2gcxp|-^(|(/„A)| + <5„,Mf} 

+ 25]exp{-^(|(/„/,)|+<5„,M)} 
< 2M exp < — > + 2M exp 



4Lo J " 1^ 4L 

The second inequahty of the display above follows from Bernstein's inequality with 
Ci = fj{-^i)fk{Xi), for every fixed j, and k and with lu^ — Lq, d — L^, ior e ^ 
\{fj,fk)\ + Sn,m, used together with the inequality e^/°+'' < e^'/2a ^ ga;/26 f^^. ^ 
and b. Therefore, for Sn,M = ^CL'^rn^M we obtain 

= E ^(£^2(fc)) 

fee{i,...,A/}\/' 

< exp j ^ "'^^ ^ + exp ' "'''^ 

Thus, the quantities (/), (//) and (///) converge to zero for any 

r„M > Ay/log{M)n/n. □ 

Lemma 3.6. Let assumptions (Al) and (A2) hold. Then 

{IV) ^ E ^(^Ei(/(^o-r(^.))i>|^^«.M) -0. 

fee{i,...,j\f}\/* V j=i / 
Proof. By the Cauchy- Schwartz inequality we have 



C n 
1=1 

(3.15) < p ( E{(/(xo ~ r{x.)f ~ 11/ - r\?} 



><S7^<M-\\f-rf) 



where we recaU that ||/ — < C/r^ by definition and Ci — Cg/64L^ — C/, 
where we assume that we have already adjusted C/ to have Ci > 0, by taking 
an appropriate constant A in the definition of t^.m, if needed. The proof follows 
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immediately from Bernstein's inequality applied to Ci = [fi^t) ^ f*{Xi)Yi with 
w = ^/CfTn^M and d = L* , and for e = Cir^ j^j. Therefore 

{IV) < M c^-^h^nrl^,} + M cxpj-^nr^ 
and both terms converge to zero for either choice of r„ m . □ 
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